multimodal data fusion
DF-DM: A foundational process model for multimodal data fusion in the artificial intelligence era
Restrepo, David, Wu, Chenwei, Vásquez-Venegas, Constanza, Nakayama, Luis Filipe, Celi, Leo Anthony, López, Diego M
In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information. We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis. These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.
Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications
Restrepo, David, Wu, Chenwei, Cajas, Sebastián Andrés, Nakayama, Luis Filipe, Celi, Leo Anthony, López, Diego M
Large-scale multi-modal deep learning models have revolutionized domains such as healthcare, highlighting the importance of computational power. However, in resource-constrained regions like Low and Middle-Income Countries (LMICs), limited access to GPUs and data poses significant challenges, often leaving CPUs as the sole resource. To address this, we advocate for leveraging vector embeddings to enable flexible and efficient computational methodologies, democratizing multimodal deep learning across diverse contexts. Our paper investigates the efficiency and effectiveness of using vector embeddings from single-modal foundation models and multi-modal Vision-Language Models (VLMs) for multimodal deep learning in low-resource environments, particularly in healthcare. Additionally, we propose a simple yet effective inference-time method to enhance performance by aligning image-text embeddings. Comparing these approaches with traditional methods, we assess their impact on computational efficiency and model performance using metrics like accuracy, F1-score, inference time, training time, and memory usage across three medical modalities: BRSET (ophthalmology), HAM10000 (dermatology), and SatelliteBench (public health). Our findings show that embeddings reduce computational demands without compromising model performance. Furthermore, our alignment method improves performance in medical tasks. This research promotes sustainable AI practices by optimizing resources in constrained environments, highlighting the potential of embedding-based approaches for efficient multimodal learning. Vector embeddings democratize multimodal deep learning in LMICs, particularly in healthcare, enhancing AI adaptability in varied use cases.
Pathology-and-genomics Multimodal Transformer for Survival Outcome Prediction
Ding, Kexin, Zhou, Mu, Metaxas, Dimitris N., Zhang, Shaoting
Survival outcome assessment is challenging and inherently associated with multiple clinical factors (e.g., imaging and genomics biomarkers) in cancer. Enabling multimodal analytics promises to reveal novel predictive patterns of patient outcomes. In this study, we propose a multimodal transformer (PathOmics) integrating pathology and genomics insights into colon-related cancer survival prediction. We emphasize the unsupervised pretraining to capture the intrinsic interaction between tissue microenvironments in gigapixel whole slide images (WSIs) and a wide range of genomics data (e.g., mRNA-sequence, copy number variant, and methylation). After the multimodal knowledge aggregation in pretraining, our task-specific model finetuning could expand the scope of data utility applicable to both multi- and single-modal data (e.g., image- or genomics-only). We evaluate our approach on both TCGA colon and rectum cancer cohorts, showing that the proposed approach is competitive and outperforms state-of-the-art studies. Finally, our approach is desirable to utilize the limited number of finetuned samples towards data-efficient analytics for survival outcome prediction. The code is available at https://github.com/Cassie07/PathOmics.
Explaining Multimodal Data Fusion: Occlusion Analysis for Wilderness Mapping
Jointly harnessing complementary features of multi-modal input data in a common latent space has been found to be beneficial long ago. However, the influence of each modality on the models decision remains a puzzle. This study proposes a deep learning framework for the modality-level interpretation of multimodal earth observation data in an end-to-end fashion. While leveraging an explainable machine learning method, namely Occlusion Sensitivity, the proposed framework investigates the influence of modalities under an early-fusion scenario in which the modalities are fused before the learning process. We show that the task of wilderness mapping largely benefits from auxiliary data such as land cover and night time light data.
Multimodal Data Fusion based on the Global Workspace Theory
Bao, Cong, Fountas, Zafeirios, Olugbade, Temitayo, Bianchi-Berthouze, Nadia
We propose a novel neural network architecture, named the Global Workspace Network (GWN), that addresses the challenge of dynamic uncertainties in multimodal data fusion. The GWN is inspired by the well-established Global Workspace Theory from cognitive science. We implement it as a model of attention, between multiple modalities, that evolves through time. The GWN achieved F1 score of 0.92, averaged over two classes, for the discrimination between patient and healthy participants, based on the multimodal EmoPain dataset captured from people with chronic pain and healthy people performing different types of exercise movements in unconstrained settings. In this task, the GWN significantly outperformed a vanilla architecture. It additionally outperformed the vanilla model in further classification of three pain levels for a patient (average F1 score = 0.75) based on the EmoPain dataset. We further provide extensive analysis of the behaviour of GWN and its ability to deal with uncertainty in multimodal data.
Vehicle Tracking Using Surveillance with Multimodal Data Fusion
Zhang, Yue, Song, Bin, Du, Xiaojiang, Guizani, Mohsen
Abstract--Vehicle location prediction or vehicle tracking is a significant topic within connected vehicles. This task, however, is difficult if only a single modal data is available, probably causing bias and impeding the accuracy. With the development of sensor networks in connected vehicles, multimodal data are becoming accessible. Therefore, we propose a framework for vehicle tracking with multimodal data fusion. Images, being processed in the module of vehicle detection, provide direct information about the features of vehicles, whereas velocity estimation can further evaluate the possible location of the target vehicles, which reduces the number of features being compared, and decreases the time consumption and computational cost. Vehicle detection is designed with a color-faster R-CNN, which takes both the shape and color of the vehicles into consideration. Meanwhile, velocity estimation is through the Kalman filter, which is a classical method for tracking. Finally, a multimodal data fusion method is applied to integrate these outcomes so that vehicle-tracking tasks can be achieved. Experimental results suggest the efficiency of our methods, which can track vehicles using a series of surveillance cameras in urban areas. ITH technological advancements in vehicles and transportation system, motorists require comfort and intelligent driving, not only mobility. Thus, there has been a great deal of research which mainly falls into one of two directions. On one hand, researchers tend to develop more intelligent vehicles, or devices that can be attached to vehicles, bringing up several popular topics such as autonomous vehicles or driverless vehicles [1].